Mapping of Sequence Reads to the Reference Genomes ◾ 55
2.2 READ MAPPING
Sequence alignment is one of the most important tasks in bioinformatics. It has been used
for decades before the recent revolution in genomics which erupted after the discovery
of the high-throughput sequencing technologies. It has been used in numerous bioinfor-
matics applications. The basic alignment algorithms are the global sequence alignment
using Needleman-Wunsch [1, 2] algorithm and the local sequence alignment using Smith–
Waterman (SW) [3] algorithm. Both these two legacy algorithms use dynamic programing,
which was invented by the mathematician Richard Bellman in early 1950s. The dynamic
programming in sequence alignment, which has nothing to do with computer program-
ming, is an algorithm that attempts to find the optimal alignment of two sequences by
dividing the whole alignment into smaller sub-alignments. The global sequence align-
ment is to align two sequences from end to end, and therefore, gaps can be introduced to
stretch the sequences to be equal in length. It is obvious that this type of alignment does
not work for aligning the massive number of short sequences to a genome sequence that
may span millions of bases. The second type is the local sequence alignment, which aligns
two sequences only to the regions of similarity. It is quite clear that this kind of sequence
alignment will work for aligning reads to a reference genome. The local sequence align-
ment in its crude form is exhaustive in a way that if you a have a single sequence and you
want to align it to multiple sequences, the algorithm will exhaust all possible alignments
and then report the results. In most cases, this approach is not practical for aligning mil-
lions of reads and it is computationally expensive with that massive number of genomics
sequences. The Basic Local Alignment Search Tool or BLAST [4] is heuristic version of the
SW local sequence alignment algorithm, which is more time-efficient than the exhaustive
approach because it searches only for the most significant patterns in the sequences and
performs sequence alignment accordingly. BLAST is fast but it may skip some alignments
because of its heuristic nature. Now after all, we can ask the question: Is BLAST suitable for
aligning millions of short-read sequences to a reference genome that may span millions of
bases? To provide a practical answer to this question, assume that BLAST could align a sin-
gle read to the reference genome in a second; if we have 20 million reads, BLAST will align
them to the reference genome in 20 million seconds (231.5 days), if we ignore the possibility
of the power or Internet outage or any other natural disaster. In conclusion, BLAST is not
efficient for read sequence alignment. Considering the large size of the reference genome
sequences and the large number of reads, one of the biggest challenges to perform search-
ing and locating the mapped region with high accuracy is indexing. Efficient indexing algo-
rithms are needed to organize the sequence of the reference genome and the short reads in
memory efficiently and to facilitate fast searching for patterns. Computer scientists play a
key role in developing data structures that allow computer programs to store and process
data effectively with less computational costs. Efficient data structures will help in both
indexing the sequence of a reference genome and fast searching of a read in an indexed
sequence with efficient memory usage. Several data structures for indexing were proposed
and implemented on software packages for read sequence alignment. The most com-
monly used data structures for indexing include suffix tree, suffix array, Burrows–Wheeler